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Abstract. Models of fermions interacting with classical degrees of freedom are applied to a 
large variety of systems in condensed matter physics. For this class of models, Weifie [Phys. Rev. 
Lett. 102, 150604 (2009)] has recently proposed a very efficient numerical method, called O(N) 
Green-Function-Based Monte Carlo (GFMC) method, where a kernel polynomial expansion 
technique is used to avoid the full numerical diagonalization of the fermion Hamiltonian matrix 
of size N, which usually costs 0(N 3 ) computational complexity. Motivated by this background, 
in this paper we apply the GFMC method to the double exchange model in three spatial 
dimensions. We mainly focus on the implementation of GFMC method using both MPI on a 
CPU-based cluster and Nvidia's Compute Unified Device Architecture (CUDA) programming 
techniques on a CPU-based (Graphics Processing Unit based) cluster. The time complexity of 
the algorithm and the parallel implementation details on the clusters are discussed. We also 
show the performance scaling for increasing Hamiltonian matrix size and increasing number of 
nodes, respectively. The performance evaluation indicates that for a 32 3 Hamiltonian a single 
GPU shows higher performance equivalent to more than 30 CPU cores parallelized using MPI. 



1. Introduction 

In solid state materials, we observe quite different physical properties such as superconductivity 
and magnetism. These different properties are explained by different behavior of electrons, which 



in turn is caused mostly by Coulomb interactions among electrons or/and interactions between 
electrons and other degrees of freedom such as phonon. Therefore, theoretical understanding 
of these interacting electrons is crucial to, e.g., design new materials with rich functionalities. 
However, precisely due to this many-body nature of these interactions, it is usually very difficult 
to treat these systems analytically and even numerically. 

Among such interacting electron systems, models of electrons interacting with classical 
degrees of freedom can be applied to a large variety of systems. One of the most studied systems 
is the double exchange model for colossal magnetoregistive manganese oxides P, 0, 0, 0). In the 
double exchange model, electrons can move around a lattice site under the influence of the 
localized classical spins via Hund's rule coupling, the Hamiltonian being given in Section [2J 
To solve this model, the Monte Carlo method is commonly used to carry out the importance 
sampling for the classical degrees of freedom after treating electron degrees of freedom by 
numerically diagonalizing the electron Hamiltonian of dimension N (where iV is a system 
size) P, El- However, since at every Monte Carlo step one must fully diagonalize the 
Hamiltonian matrix to evaluate the partition function for a given configuration of the classical 
spins, it usually costs 0(iV 3 ) numerical complexity. Because of this 0(iV 3 ) scaling of the 
calculations, the accessible system size is strictly limited to up to hundreds to thousands 
sites, preventing us from studying important properties such as the Curie temperature of a 
ferromagnetic ordering [H, 0, 0] . 

In order to reduce the numerical complexity, Kernel Polynomial Method (KPM) is very 
advantageous [jj. Indeed, Motome and Furukawa have applied a Chebyshev expansion of the 
electron density of states for evaluating the partition function, which can reduce the numerical 
complexity down to 0(N 2 ) 0]. They have also proposed that the numerical complexity can be 
further reduced to 0(N) by truncating the moment calculations for the Chebyshev expansion 0]. 
Very recently, Weifie has pointed out that the same numerical complexity of O(iV) can be 
achieved by using a Green-function-based Monte Carlo (GFMC) method, instead of directly 
evaluating the partition function for a given spin configuration 0]. The GFMC method 
dramatically decreases the computational time by using only a few local Green's functions 
estimated by the Chebyshev expansion to calculate the change of the electron density of states. 
Moreover, this O(N) GFMC method is more attractive than the one for directly evaluating the 
partition function with the same complexity of O(N) 0] because the GFMC method achieve 
the O(iV) complexity without any truncation 0]. 

Motivated by these recent progress, here we apply the O(N) GFMC method to the double 
exchange model. In this paper, we mainly focus on the implementation of the GFMC method 
using both MPI on a Central Processing Unit (CPU) cluster and Nvidia's Compute Unified 
Device Architecture (CUDA) programming techniques on a Graphics Processing Unit (GPU) 
cluster 0]. The time complexity of the algorithm is estimated. The implementation details on 
a GPU cluster is also described since programming on a GPU cluster requires more attention 
on hardware aspects such as memory copy and communication between CPU and GPU. The 
performance evaluation indicates that the GPU program can archive over 10 x speedup as 
compared to the MPI parallelization on a CPU cluster. 

This paper is organized as follows: Section [2] explains the double exchange model and 
formulates the GFMC method, Section [3] describes the implementation details and the 
parallelization techniques, which are tested in Section [H Section [5] shows the performance on a 
single CPU processor as well as the whole cluster, and finally, Section [6] will give the conclusions. 



2. Model and Method Formulation 

2.1. Double Exchange Model 

The double exchange (DE) model describes electrons interacting the classical spins via Hund's 
rule coupling. The Hamiltonian is given by 

H = -t ( c L c io + H - c -) +JhJ2&- c \ a °*pc lP , (i) 

where c] CT is a creation operator of electron at site i and with spin a = (t)i)j (hj) runs over 
a pair of nearest neighbor sites i and j, Si is the classical spin at site i with its normalization 
\Si\ = 1, and Jh denotes the Hund's rule coupling with Jh > 0. In the limit of infinite Jh 
(Jh — > oo), the Hamiltonian becomes 

# = -E%( c L^ + H.c.) (2) 

<ij) 

with the hopping amplitude 
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where 6i and (f>j denote the polar and the azimuthal angles of the classical spin Si, respectively. 
The grand partition function of the DE model is written as 

N f ~ r - i 

Z = HJ d3 ^ Tr ° [eM-PH({Si}) - fiN\ , (5) 

i 

where (3 = 1/T is inverse of temperature T, /x is the chemical potential, N is the total number 
operator of electrons, and Tr e [- • • ] indicates the trace over the electron degrees of freedom in the 
Fock space. The trace over classical spin degrees of freedom, i.e., flf J d 3 SiP({Si}), is evaluated 
by the Monte Carlo importance sampling with its weight P({Si}) for a given spin configuration 
{Si}, 



P({Si}) = Tr c exp(-^({5i} - fiN) = exp[-S off ({5 i })], (6) 



where 



N 

5 eff ({5 i }) = -Elog(l + e-^«^»-^) (7) 

i 

= -J log(l + e-^ E -^)p(E)dE, (8) 

ei is the i-th eigenvalue of Hamiltonian H({Si}), and p(E) is the electron density of state 
(DOS) for a given spin configuration {Si}. Using the Metropolis algorithm, the possibility 
P({Si} — > {S^}) of accepting a new spin configuration {S^} is given by 

p(IS-\ (S'\) = P ^ S '^ = e -S e fi({^})+ s aff({^}) 
P({Si}) 



Therefore, the quantity A se fj define by 



A scfr = S ef j({S*}) - S eS ({Si}) = - J log(l + e-^ E -A){p'{E) - P {E))dE (9) 

is all we need to carry out the Monte Carlo calculation. Here, p'{E) is DOS for a given spin 
configuration {S^}. As mentioned in Section[Q we can exactly diagonalize H({Si}) with 0(iV 3 ) 
complexity to evaluate A sc fj P, 0, El], or we can use Kernel Polynomial Method (KPM) with 
0(N 2 ) or 0(N) complexity to estimate the DOS directly @, 0]. However, the latter one with 
O(N) complexity must truncate the moment calculations. Here, we use the recently proposed 
GFMC method [8|], as will be explained below for completeness. 



2.2. Green- function-based Monte Carlo (GFMC) method 

In this 0{N) GFMC method, we need to evaluate only several elements of the Green's function 
to calculate A se g- given in Eq. ([9]), provided that the change of spin configuration {Si} — > {S^} 
is local, i.e, only a single spin at site i is changed with the rest of spins unaltered. We first give 
the definition of the Green's function G(z) in the complex plane: 

G(z) = j^— z , z = E + ie, (10) 

where E is real and e is a very small positive real. The Hamiltonian for a given spin configuration 
{S{} ({S'j}) is denoted by H (H'), and the Hamiltonian difference A is thus A = H' — H. 
The determinate of G(z)(H' — z), i.e., 



d{z) := Det [G(z)(H' - z)] = Det [G(z)(H - z) + G{z)A] = Det [1 + G{z)A] (11) 

N -, N 

= Det [G(z)\ [(H> -z)]=H —— l[e>-z, (12) 



has a special role in the GFMC method, where q and are the i-th eigenvalues of H and H', 
respectively. It is readily shown that 

— lim Im ^ - — = — lim Im ( — ) 

7T e^O dz IT e^O £j — Z ^ €■ - Z J 

N N 

= Y d 8(E-e^-J2S(E-e i ) 

i i 

= p(E) - p\E). (13) 

Thus, this equation in the left hand side can be used in Eq. ([9]), and A se g is now described using 
d(z): 

A scff = 1 limlm / log(l + e-^ f l ° g( f {z)) dE 

7T e^O J dz 

=U i+^-^ ha]oe{d{z))dE - (14) 

It is important to notice that we need only several elements of G(z) to evaluate d(z) = 
Det [1 + G(z)A]. As an example, here we consider the simple cubic lattice with the nearest 
neighbor hopping, and we assume that only one spin at site o is changed: S Q — > S' Q . In the cubic 



lattice, site o has 6 nearest neighbors (NN) denoted by {n, e, s, w, t, b}. Then, the A matrix has 
a very simple form of 
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Therefore, to evaluate d{z), only the following 7x7 Green's functions have to be calculated 



G 



Gn,n 


Gn 7 e 


Gn,s 


Gn,o 


Gn,w 


G n ,t 


G n ,b 


Ge 7 n 


G e ,e 


G e ,s 


G e ,o 


Ge,w 


G e ,t 


G e ,b 


Gs : n 


G s ,e 


G S: s 


G s ,o 


Gs 7 w 


G Si t 


G s ,b 


Go,n 


Go^e 


G 0i s 


Go,o 


Gq^w 


G ,t 


G ,b 


Gw,n 


G w ^ e 


G w ^ s 


G-w,o 


Gw 7 W 


G Wi t 


G w ,b 


Gt, n 


Gt : e 


Gt, 3 


Gt,o 


Gt,w 


Gt : t 


Gt,b 


Gb,n 


Gb,e 


Gb, s 


Gb,o 


Gb,w 


Gb,t 


Gb,b 



(16) 



The computation can be further simplified by expanding d{z) as the following: 



d{z) = det(l + G(z)A) (17) 

= [1+ ]T A jo G oj (z)][l+ ]T A oj G jo (z)} (18) 
jeNN jeNN 

-G 00 [ J2 ^oA ok G kj (z)}, (19) 

j,keNN 

where 

A jo = (j\A\o),G oj = (o\G\j). (20) 
Moreover, using the following state \v): 

\v)=A\o)= £ A io |j), (21) 

jeNN 

d(z) can be compactly expressed as 

d{z) = [1 + G ov (z)][l + G vo (z)\ - G 00 (z)G vv (z). (22) 

Notice that we now need only a 2 x 2 Green's function to evaluate d{z). 

Now, a question is how to calculate efficiently the local Green's functions G(z). For this 
purpose, we use the KPM [jj, which can be efficiently implemented in a GPU cluster [l(J. Using 
two types of Chebyshev polynomials (m: integer), 

T m (x) = cos [marccos(x)] 

sin [(m + l)arccos(x)] , (23) 



U m {x) 
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the diagonal elements of the Green's function are expanded as 
Ga(w + ie 
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(24) 



Since the Chebyshev polynomials T m (x) and U m (x) requires that the argument x should be 
within [—1, 1], we must renormalize the energy spectrum E to w. The fl m represents the m-th 
moment (defined below) after applying a kernel function, fi m = fi m gm, where g m is the kernel 
function to eliminate Gibbs oscillations [Bj]. Here, we apply Lorenz kernel function which is 
defined as g m = sinh[A(l — m/M)]/sinh(A) with appropriate choice of A [jj. The m-th moment 
fj, m is defined as 

Mm = — nm I m / Ga(w + ie)T m (w)dw 

7T e-»0 J 

^2(i\n)(n\i)5(w - e n )T m (w)dw 

n 

= ^2(i\n)(n\i)T m (e n )dE 
n 

= J2(i\T m (H)\n)(n\i) 

n 

= (i\T m (H)\i), (25) 

where H is the renormalized Hamiltonian to fit the spectra within [—1,1]. 

Defining \a m ) = T m (H)\i) and thus fi m = (i\a m ), the Chebyshev series in Eq. ([23]) yeilds the 
recursive relation of these vectors \a m ), namely, 




(26) 
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In addition, the number M of coefficients is obtained with M/2 iterations if the following 
equation is used, 
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When the coefficients [i m are all calculated, we can evaluate the Green's function using fast 
Fourier transform [jj. In Eq. (|24p . if we choose 

^ + ^ } (jfe = 0,l,..,M-l) 



w = cos 



the Green's function becomes 

Gu(w + ie) 
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where 
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Let us now denote the summation part in Eq. (|29p by Xk'- 
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It should be recalled that the following expression is required for the fast Fourier 
transformation 111 ]: 
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Using the following correspondence between Xj and 7^- 



f X2j = 7j, 



j = 0,l,...,M/2-l, 



■i-i 



can be evaluated using the fast Fourier transformation, and thus the time complexity of 
calculating Eq. (|29|) reduces to 0(Mlog(M)), where M is the number of moments kept. 

The off diagonal elements of the Green's function, G ov and G vo , can be evaluated similarly 
if we use the follow mixed elements of the Green's function (note that i below is the imaginary 
unit): 



G -\-v,o+v = ((v\ + (o\)G(\v) + \o)) = G 00 + G vv + G ov + G vo 
G 0+ i Vl o+iv = (-i(v\ + (o\)G(\o) + i\v)) = G 00 + G vv + iG ov - iG vo , 

G ov and G vo are now expressed by the diagonal elements of the Green's functions: 

Gov — 2 [Go+v,o+v G o G V v i{.G 00 + G V v G -\-iv ,o-\-iv)\ 1 
Gvo — 2 [Go+v,o+v G o G V v i\G -\-iv,o J riv G Q o G vv )\. 



(33) 
(34) 
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(36) 



Using these four elements of the 2x2 Green's function, we can readily calculate d(z) and thus 
A sc ff can be evaluated. 

Finally, it should be noted that if Hamiltonian matrix H is stored in a compression format, 
the time complexity of Eq. (|26p is 0{NM), where iV is the dimension of H. As mentioned above, 
the time complexity to calculate Eq. ()24|) is 0(M log (M)), and thus the total time complexity 
is expected to be O(iVM) + 0(M log(M)). When M is fixed, the time complexity should scale 
linearly with the dimension of H. Indeed, we find, as shown in Fig. [lj that the execution time 
is approximately proportional to the Hamiltonian size TV. 
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Figure 1. Hamiltonian size dependence of the execution time for (a) the exact diagonalization 
method P, 0, S] and (b)the GFMC method. Here, for simplicity, we performed only 10 Monte 
Carlo trial flips of spins, and M is fixed for different Hamiltonian sizes. We can clearly see 
that the time consumption for the exact diagnolization method roughly follow the complexity 
of 0(N 3 ), but for the GFMC method it is almost linear. 

3. Implementation and Parallelization Schemes 

3.1. Algorithms 

For a given temperature T, we use Algorithm [T] to calculate the magnetization M for the classical 
spins through the average over S Monte Carlo sweeps (the loop from line[T]), where each sweep 
corresponds to N spin trail flips (the loop from line [2]) ■ Since the direction of the magnetization 
is trivial, the magnetization M is defined here as the length of the total spin vector. 

In the implementation, we apply the Metropolis method (line [TU] in the Algorithm [TD to 
determine whether a trail flip is accepted by comparing with a random number between and 
1. 

Section[2]has illustrated how to calculate the A se g (from line|3]to[9]) using the GFMC method. 
Especially, the calculation of the expansion coefficients /i m plays a curial role (line [6]) . This part 
occupies most of the execution time since the recursion [denoted by Eq. (|26p ] involves intensive 
matrix- vector multiplication (MVM) with complexity of O(N). Note, however, that GPU is 
an ideal platform to parallelize MVM, because the multiplications between the rows and the 
vector could be distributed to hundreds of streaming processors. Therefore, we focus on a GPU 
implementation with the highly parallelism and expect a large speedup factor as compared with 
the CPU one. 

3.2. Parallelization Frameworks 

The magnetization M as a function of temperature T is obtained through the average over a 
large number of Monte Carlo sweeps. If the system stays in the thermal equilibrium, the Monte 
Carlo sweeps are independent with each other. Therefore, we could divide the outside loop 
S (line [T]) into many threads. However, the warm up sweeps, which are necessary to achieve 
the thermal equilibrium, should be abandoned for taking the average, and this condition may 
prevent us from dividing S to too many threads because some threads may have too few sweeps 
to reach thermal equilibrium. 

In addition to this parallelism for the Monte Carlo sampling, the intensive matrix-vector 
and vector-vector multiplications in line [6] can be effectively parallelized into multi-core CPU 
processors using OpenMP or many-core GPU processors using CUDA. In this paper, we focus 
on the implementation based on GPU to archive higher parallelism than multi-core CPU. 

In our algorithm, we combine two parallelizations to achieve high efficiency, i.e., the Monte 
Carlo sweeps are divided into several MPI threads while in each thread we employ GPU to 
calculate matrix-vector operations. 



Algorithm 1 Calculate the magnetization as a function of temperature 



Require: Integer S to represent number of Monte Carlo sampling sweeps 
Require: Hamiltonian matrix H of dimension N x N 
Require: Scalar M to store accumulated magnetization 

1: for i = 1 — > S do 

2: for j = 1 ->■ AT do 

3: Randomly choose site o and change the spin randomly 

4: Calculate the modification matrix A <— H — H' using Eq. ([3]) 

5: Get vector it <— A 77" 

6: Calculate the coefficients jlm , firn , fim rlv \ fim applied Lorentz kernel function g m , 

where jlm ) = (V\T m (H)\ V)g m 
7: Calculate 4 elements of the Green's function G 00 ,G vv ,G 0+ i V)0+ i v and G -i Vt0 -i v using 

Eq. ([21 

8: Calculate using Eq. ([22]) 

9: Calculate A se g- using Eq. (Q3D 
10: if e~ Ascff > rand() then 
11: Accept the new spin configuration. 

12: Update H : H <- il + A 

13: end if 
14: end for 

IV* s-l 

15: M = M+ l^jy 'I 
16: end for 

17: Magnetization «— M/S 1 



5.5. Implementation on GPU 

3.3.1. Compression Format of Hamiltonian Matrix Because the Hamiltonian matrix may be 
very large, e.g. 20 3 x 20 3 , an effective compression technique must be introduced to reduce not 
only the memory consumption but also unnecessary memory access. The compression storage 
format of the Hamiltonian matrix may varies with different lattices and different physics models. 
For the simple cubic lattice used in this paper, each site has 6 nearest neighbors that lead to 
6 non-zero entries in each row. This case is very suitable for ELLPack format because no 
extra data need to be padded to the end of the rows. ELLPack format needs two matrices to 
store the non-zero entries and column indices, respectively. On GPU device, the two matrices 
are stored in column major to trigger coalesced memory access. 

3.3.2. Arrangement of CUDA Kernel Functions One important issue that should be well 
addressed is the communication between GPU and CPU. Since the Hamiltonian matrix and 
the spin configurations may be updated after each iteration, if the Hamiltonian matrix are 
stored in CPU memory, the data transfer between CPU and GPU may significantly decrease the 
performance. This is especially true when the lattice size is small because the execution time for 
numerical calculation occupies a low percentage of the overall time consumption. According to 
our test, the simulation of 6 x 6 x 6 cubic lattice on a 8x PCIe Tesla C2050 is about 30% slower 
than on a 16x PCIe C2050. Therefore, the data transfer may not only decrease the performance, 
but also make the program to be more dependent on the bandwidth between the CPU and GPU. 

To avoid the memory transfer between CPU and GPU as much as possible, in our 
implementation the matrix H and the spin configurations are kept in GPU memory. 

With these considerations, we propose a multi- kernel GPU implemenation shown in the 
Figure [21 in which the kernel functions are indicated by "<<<>>>" and arranged as the 



for i 


= 1^5 


for 


j = 1 N 




II choose a spin randomly and flip a spin with a random direction 




// and calculate matrix A 




FlipSpinAndCalculateDelta <<<>>> (); 




// prepare vectors \o) , \v) , \o) + i\v) , \o) — i\v) 




PrepareVector <<<>>> (); 




// calculate the expansion coefficients for each Green' s function 




for each vector in { |o), \v), \o) + i\v), \o) — i \v) } 




for m=0 to M/2 




// cacluate 2 coefficients for each iteration 




CalculateCoef f icient<<<>>> (); 




end for 




end for 




// transform Eq. (12411 to meet the requirement of FFT 




PrepareForFFT<<<>>> (); 




// perform FFT to obtain four Green' s functions 




Perf ormsFFT<«»> (); 




// Calculate A se ff 




CalculateDSEFF<<<>>> (); 




// If a trial spin flip is to be accepted, update the status 




UpdateStatus<<<>>> (); 




// Calculate the magnetization 




MeasureSpin<<<>>> (); 


end 


for 


end for 





Figure 2. Pseudocode of GPU implementation 



following: 

Function "FlipSpinAndCalculateDelta" flips a spin with a random direction and calculates 
the Hamiltonian matrix difference A, which is used in the function "PrepareVector" to create 
four vectors {|o), \v), \o) + i\v), \o) — i\v)}. For each vector, the function " CalculateCoefficient" 
calculates two expansion coefficients /x m in every iteration. After all the coefficients are ready, 
function " PrepareForFFT" prepares the data for fast Fourier transformation (FFT), including 
applying Lorentz kernel function and transform Eq. (|24p to meet the requirement of FFT. After 
performing FFT by function "PerformsFFT" , we obtain A sc g using function " CalculateDSEFF" . 
If the flip is to be accepted, function "UpdateStatus" will update the spin configuration and 
the Hamiltonian matrix. After one MC sweep finishes, function "MeasureSpin" accumulates the 
spins to calculate the magnetization. 

3.3.3. Implementation of the Recursion The most important kernel function is "CalculateCo- 
efficient", which involves the most intensive computation due to the recursive matrix- vector 
operations. Figure [3] demonstrates the implementation of this CUDA kernel function to calcu- 
late the coefficients [iim- In CUDA architecture, the parallel running threads are divided into 
several thread groups, called "Blocks". For example, in Figure [3] there are p thread blocks and 
each block has 4 threads, indexed from to 3. Each thread performs the multiplication between 
a row and the vector |a m _i) that is stored in a ID array Rl. S ince the produced vector 
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Figure 3. Implementation of function " CalculateCoefficient" to calculate the moments \iir. 



could be placed on RO overwriting the vector |o m _2), we only need two ID memory arrays, RO 
and Rl, to perform calculations by exchanging their pointers after each iteration. 

After the vector \a m ) is ready, a production (a m \a m ) is performed, shared memory is applied 
to store the product result. (a m \a m ) is a dot product, for CUDA, it is a reduction problem, 
which will be carried out in two steps, first, the product result in each block is accumulated to 
produce a scalar and then the scalar in each block is further accumulated to get coefficient H2m 
using Eq. [28j 

This implementation could eliminate the most data transfers during the calculation. The 
bandwidth test shows the performance difference between a 8x and 16x PCIe based Tesla C2050 
is very small (less than 6%) while calculating a 6 3 cubic lattice. 

4. Validation 

In order to check the availability of our implementation, The DE model on the simple cubic 
lattice is simulated. The magnetization M as a function of temperature T is examined and the 
results are shown in Figure [H The simulations are performed with the following parameters: 
Chebyshev truncation number is 256, chemical potential \x is 0, the average is taken over 5000 
MC sweeps. 

5. Performance evaluation 

5.1. Experimental Setup 

The performance evaluation is performed on a cluster system that has 16 nodes of the NEC 
LX series connected with 40GByte/sec Infiniband network via the switch. Each node has two 
NVIDIA Tesla M2050 GPUs connected to the separate PCI Express buses and 12 CPU cores 
(Xeon E5645 2.4GHz). The OS of the node is CentOS 6 and the graphics driver with CUDA 4.0 
is installed. The compiler for CPU is gcc-4.4.4. Two CPUs are tightly coupled with sharing the 
memory bus (totally 12 CPU cores in a node). Amount of total memory is 12GBytes per node. 
All experiments performed below use the compiler option "-03" for CPU and GPU programs. 
A head node of the cluster is also connected to the cluster via the Infiniband network to serve 
the home file system in order to share the program resources. 

We evaluate the performance in two ways. The first is to measure the performance of a single 
CPU core and a single GPU with different Hamiltonian matrix sizes. The second is to evaluate 




Figure 4. Magnetization M as a function of temperature T for different sizes (indicated in the 
figure) of the simple cubic lattice. 




the overall performance comparison between all CPU cores and all GPUs. 
5.2. Performance Scaling on Multi-core CPU 

Section [3.11 has explained that the process of calculating coefficients \x involves intensive MVM 
operations. The performance of MVM largely depends on the memory access speed. Therefore, 
the memory bandwidth plays an essential role in the overall performance. This conclusion is 
consistent with our mutli-core CPU performance scaling test shown in Figure [5j 

This test is performed on a single node that has 2 CPUs, totally 12 cores and 12GB DDR2 800 
memory with a peak bandwidth of 12.5GB/s. In order to evaluate the parallelization efficiency, 
we examined the quantity expressed by y = execution time x nprocs, in which nprocs represents 
the number of MPI threads. As the outside loop [from line 1] in Algorithm [1] is divided into 
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Figure 6. Performance scaling with Hamiltonian matrix size on cluster, the result is obtained 
with 38400 trail flips calculated and Chebyshev truncation number kept as 256 

mutliple MPI threads and there is very little communications among these threads, in an ideal 
case of parallelization, y should be a constant. However, we have noticed that when the threads 
number becomes larger than 4, y begins to increase fast, suggesting that the efficiency become 
to decrease rapidly. 

It indicates that the memory bandwidth becomes saturated when nprocs grows to be larger 
than 4. Therefore, we can conclude that the bottleneck of this method is memory bandwidth. 
In this paper we resort to solve the problem using Tesla C2050 GPU with maximum bandwidth 
of 144GB/s 0. 

5. 3. Performance Scaling for Increasing Hamiltonian Size on Cluster 

Figure [6] shows the performance scaling on the cluster while the Hamiltonian size is increasing 
from 6 3 to 32 3 . 38400 Monte Carlo trail flips are calculated while the number of Chebyshev 
moments are kept as 256. It can be observed that when the H matrix is small, e.g. 6 3 or 8 3 , 
192 CPU cores are much faster than 32 GPUs. To explain the reason, let us divide the total 
time consumption of GPU program into two parts: a) the time for numerical calculation and b) 
the time for other works such as initializing GPU device, copying memory between CPU and 
GPU, and MPI communication, etc. Usually the latter part consumes very little time, but when 
the matrix size is small, comparing with the first part, time consumption of the latter part can 
not be neglected and significantly decrease the performance. However, the speedup increases 
rapidly with increasing H matrix size and in the best case, i.e. 32 3 , 32GPU is about over than 
5 times faster than 192 CPU cores. If we define the speedup factor as the equivalent number of 
CPU cores comparing with one GPU. For Hamiltonian with size of 32 3 , this factor is about 30 
(192/32 * 5). 

5.4- Performance Scaling for Increasing Number of Nodes 

The performance scaling for increasing number of nodes is also very important reference for 
implementing the algorithm on a large cluster or even supercomputer. Here 8000 trail flips 
of a 20 3 cubic lattice Hamiltonian is calculated on our cluster that has 2 Tesla C2050 and 12 
Xeon 2.4GHz cores in each node. The performance evaluation is shown in Figure [71 As there is 
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Figure 7. Performance scaling for increasing number of nodes, each node has 2 Tesla C2050 
and 12 Xeon 2.4GHz cores. 8000 Monte Carlo trail flips of a 20 3 cubic lattice Hamiltonian are 
calculated with the Chebyshev truncation number kept as 256. The blue/green lines represent 
time consumption of CPU cores/GPUs 

very little communication among MPI threads, it can be noticed that for both CPU and GPU 
implementation, the time consumption always decreases by half when the number of nodes is 
doubled , suggesting an ideal performance scaling with number of nodes. 

5.5. Performance Considerations 

As shown in Figure the bottleneck of this algorithm is the memory bandwidth, GPU could 
solve the memory bandwidth limitation very well and shows high speedup ratio. Due to the 
nature that there is little communication among MPI threads, this algorithm is also capable to 
run on many compute nodes while keeping good parallelization efficiency 

However, due to the thermal equilibrium problem explained in Section 13.21 the of number of 
MPI threads is limited, in this case, a combination of parallelization techniques such as MPI, 
OpenMP and CUDA should be introduced. For example, the workload is firstly distributed 
into different nodes using MPI, and then in each node we use OpenMP to calculate the MVM, 
if the CUDA is capable, the OpenMP thread may employ GPU to calculate the MVM. The 
combination can increase the parallelism and gain high performance. However, it should be 
pointed out that the process to combine these techniques may not be very straightforward since 
we have to take trade-offs in many occasions. 

6. Conclusion and Future Work 

In this paper, we have provided the detailed formulation of the GFMC method for the DE model. 
We have proposed implementation methods to parallelize the GFMC method using MPI on CPU 
as well as CUDA on GPU. In order to eliminate the data transfer between CPU and GPU as 
much as possible, the program is implemented on GPU using several kernel functions. The 
performance evaluation indicates that the GPU implementation could effectively overcome the 
CPU memory bandwidth limitation and shows more than 30 speedup factor when Hamiltonian 



size is 32 . The performance scaling test for increasing number of processors indicates that the 
MPI parallelization is very effective for this algorithm. Finally we discussed more considerations 
on the performance for future optimizations. 

For the future work, we will implement this method as a GPU based library that can be used 
for other lattice models. 
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